{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>\n",
    "\n",
    "<i>Licensed under the MIT License.</i>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# LSTUR: Neural News Recommendation with Long- and Short-term User Representations\n",
    "LSTUR \\[1\\] is a news recommendation approach capturing users' both long-term preferences and short-term interests. The core of LSTUR is a news encoder and a user encoder.  In the news encoder, we learn representations of news from their titles. In user encoder, we propose to learn long-term\n",
    "user representations from the embeddings of their IDs. In addition, we propose to learn short-term user representations from their recently browsed news via GRU network. Besides, we propose two methods to combine\n",
    "long-term and short-term user representations. The first one is using the long-term user representation to initialize the hidden state of the GRU network in short-term user representation. The second one is concatenating both\n",
    "long- and short-term user representations as a unified user vector.\n",
    "\n",
    "## Properties of LSTUR:\n",
    "- LSTUR captures users' both long-term and short term preference.\n",
    "- It uses embeddings of users' IDs to learn long-term user representations.\n",
    "- It uses users' recently browsed news via GRU network to learn short-term user representations.\n",
    "\n",
    "## Data format:\n",
    "\n",
    "### train data\n",
    "One simple example: <br>\n",
    "\n",
    "`1 0 0 0 0 Impression:0 User:2903 CandidateNews0:27006,11901,21668,9856,16156,21390,1741,2003,16983,8164 CandidateNews1:8377,10423,9960,5485,20494,7553,1251,17232,4745,9178 CandidateNews2:1607,26414,25830,16156,15337,16461,4004,6230,17841,10704 CandidateNews3:17323,20324,27855,16156,2934,14673,551,0,0,0 CandidateNews4:7172,3596,25442,21596,26195,4745,17988,16461,1741,76 ClickedNews0:11362,8205,22501,9349,12911,20324,1238,11362,26422,19185 ...`\n",
    "<br>\n",
    "\n",
    "In general, each line in data file represents one positive instance and n negative instances in a same impression. The format is like: <br>\n",
    "\n",
    "`[label0] ... [labeln] [Impression:i] [User:u] [CandidateNews0:w1,w2,w3,...] ... [CandidateNewsn:w1,w2,w3,...] [ClickedNews0:w1,w2,w3,...] ...`\n",
    "\n",
    "<br>\n",
    "\n",
    "It contains several parts seperated by space, i.e. label part, Impression part `<impresison id>`, User part `<user id>`, CandidateNews part, ClickedHistory part. CandidateNews part describes the target news article we are going to score in this instance, it is represented by (aligned) title words. To take a quick example, a news title may be : `Trump to deliver State of the Union address next week` , then the title words value may be `CandidateNewsi:34,45,334,23,12,987,3456,111,456,432`. <br>\n",
    "ClickedNewsk describe the k-th news article the user ever clicked and the format is the same as candidate news. Words are aligned in news title. We use a fixed length to describe an article, if the title is less than the fixed length, just pad it with zeros.\n",
    "\n",
    "### test data\n",
    "One simple example: <br>\n",
    "`1 Impression:0 User:6446 CandidateNews0:18707,23848,13490,10948,21385,11606,1251,16591,827,28081 ClickedNews0:27838,7376,16567,28518,119,21248,7598,9349,20324,9349 ClickedNews1:7969,9783,1741,2549,27104,14669,14777,21343,7667,20324 ...`\n",
    "<br>\n",
    "\n",
    "In general, each line in data file represents one instance. The format is like: <br>\n",
    "\n",
    "`[label] [Impression:i] [User:u] [CandidateNews0:w1,w2,w3,...] [ClickedNews0:w1,w2,w3,...] ...`\n",
    "<br>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Global settings and imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "System version: 3.6.10 |Anaconda, Inc.| (default, May  8 2020, 02:54:21) \n",
      "[GCC 7.3.0]\n",
      "Tensorflow version: 1.15.2\n"
     ]
    }
   ],
   "source": [
    "import sys\n",
    "sys.path.append(\"../../\")\n",
    "import os\n",
    "from reco_utils.recommender.deeprec.deeprec_utils import download_deeprec_resources \n",
    "from reco_utils.recommender.newsrec.newsrec_utils import prepare_hparams\n",
    "from reco_utils.recommender.newsrec.models.lstur import LSTURModel\n",
    "from reco_utils.recommender.newsrec.io.news_iterator import NewsIterator\n",
    "import papermill as pm\n",
    "from tempfile import TemporaryDirectory\n",
    "import tensorflow as tf\n",
    "\n",
    "print(\"System version: {}\".format(sys.version))\n",
    "print(\"Tensorflow version: {}\".format(tf.__version__))\n",
    "\n",
    "tmpdir = TemporaryDirectory()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Download and load data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 21.2k/21.2k [00:01<00:00, 13.7kKB/s]\n"
     ]
    }
   ],
   "source": [
    "data_path = tmpdir.name\n",
    "yaml_file = os.path.join(data_path, r'lstur.yaml')\n",
    "train_file = os.path.join(data_path, r'train.txt')\n",
    "valid_file = os.path.join(data_path, r'test.txt')\n",
    "wordEmb_file = os.path.join(data_path, r'embedding.npy')\n",
    "if not os.path.exists(yaml_file):\n",
    "    download_deeprec_resources(r'https://recodatasets.blob.core.windows.net/newsrec/', data_path, 'lstur.zip')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create hyper-parameters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "tags": [
     "parameters"
    ]
   },
   "outputs": [],
   "source": [
    "epochs=6\n",
    "seed=40"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "WARNING:tensorflow:\n",
      "The TensorFlow contrib module will not be included in TensorFlow 2.0.\n",
      "For more information, please see:\n",
      "  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md\n",
      "  * https://github.com/tensorflow/addons\n",
      "  * https://github.com/tensorflow/io (for I/O related ops)\n",
      "If you depend on functionality not listed there, please file an issue.\n",
      "\n",
      "data_format=news,iterator_type=None,wordEmb_file=/tmp/tmptei5l16e/embedding.npy,doc_size=10,title_size=None,body_size=None,word_emb_dim=100,word_size=28929,user_num=10338,vert_num=None,subvert_num=None,his_size=50,npratio=4,dropout=0.2,attention_hidden_dim=200,head_num=4,head_dim=100,cnn_activation=relu,dense_activation=None,filter_num=400,window_size=3,vert_emb_dim=100,subvert_emb_dim=100,gru_unit=400,type=ini,user_emb_dim=50,learning_rate=0.0001,loss=cross_entropy_loss,optimizer=adam,epochs=6,batch_size=64,show_step=100000,metrics=['group_auc', 'mean_mrr', 'ndcg@5;10']\n"
     ]
    }
   ],
   "source": [
    "hparams = prepare_hparams(yaml_file, wordEmb_file=wordEmb_file, epochs=epochs)\n",
    "print(hparams)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "iterator = NewsIterator"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Train the LSTUR model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "WARNING:tensorflow:From ../../reco_utils/recommender/newsrec/models/base_model.py:39: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.\n",
      "\n",
      "WARNING:tensorflow:From /data/anaconda/envs/reco_gpu/lib/python3.6/site-packages/tensorflow_core/python/keras/initializers.py:119: calling RandomUniform.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.\n",
      "Instructions for updating:\n",
      "Call initializer instance with the dtype argument instead of passing it to the constructor\n",
      "WARNING:tensorflow:From /data/anaconda/envs/reco_gpu/lib/python3.6/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.\n",
      "Instructions for updating:\n",
      "If using Keras pass *_constraint arguments to layers.\n",
      "WARNING:tensorflow:From /data/anaconda/envs/reco_gpu/lib/python3.6/site-packages/tensorflow_core/python/keras/backend.py:3994: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.\n",
      "Instructions for updating:\n",
      "Use tf.where in 2.0, which has the same broadcast rule as np.where\n",
      "WARNING:tensorflow:From ../../reco_utils/recommender/newsrec/models/base_model.py:54: The name tf.GPUOptions is deprecated. Please use tf.compat.v1.GPUOptions instead.\n",
      "\n"
     ]
    }
   ],
   "source": [
    "model = LSTURModel(hparams, iterator, seed=seed)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "WARNING:tensorflow:From ../../reco_utils/recommender/newsrec/io/news_iterator.py:95: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.\n",
      "\n",
      "{'group_auc': 0.5353, 'mean_mrr': 0.1703, 'ndcg@5': 0.173, 'ndcg@10': 0.2252}\n"
     ]
    }
   ],
   "source": [
    "print(model.run_eval(valid_file))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "at epoch 1\n",
      "train info: logloss loss:1.6012276304011441\n",
      "eval info: group_auc:0.5527, mean_mrr:0.1762, ndcg@10:0.2392, ndcg@5:0.171\n",
      "at epoch 1 , train time: 26.1 eval time: 13.2\n",
      "at epoch 2\n",
      "train info: logloss loss:1.5758061764191609\n",
      "eval info: group_auc:0.5631, mean_mrr:0.1751, ndcg@10:0.2363, ndcg@5:0.1696\n",
      "at epoch 2 , train time: 23.8 eval time: 13.1\n",
      "at epoch 3\n",
      "train info: logloss loss:1.5535941887874993\n",
      "eval info: group_auc:0.5573, mean_mrr:0.1749, ndcg@10:0.2373, ndcg@5:0.1774\n",
      "at epoch 3 , train time: 23.8 eval time: 13.1\n",
      "at epoch 4\n",
      "train info: logloss loss:1.5352852480752128\n",
      "eval info: group_auc:0.575, mean_mrr:0.1789, ndcg@10:0.2419, ndcg@5:0.184\n",
      "at epoch 4 , train time: 23.8 eval time: 13.1\n",
      "at epoch 5\n",
      "train info: logloss loss:1.5134395093333965\n",
      "eval info: group_auc:0.5817, mean_mrr:0.1884, ndcg@10:0.2515, ndcg@5:0.1903\n",
      "at epoch 5 , train time: 23.8 eval time: 13.1\n",
      "at epoch 6\n",
      "train info: logloss loss:1.4875014397562767\n",
      "eval info: group_auc:0.5788, mean_mrr:0.1931, ndcg@10:0.2571, ndcg@5:0.1931\n",
      "at epoch 6 , train time: 23.6 eval time: 13.2\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "<reco_utils.recommender.newsrec.models.lstur.LSTURModel at 0x7f7054567668>"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.fit(train_file, valid_file)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'group_auc': 0.5788, 'mean_mrr': 0.1931, 'ndcg@5': 0.1931, 'ndcg@10': 0.2571}\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/data/anaconda/envs/reco_gpu/lib/python3.6/site-packages/ipykernel_launcher.py:3: DeprecationWarning: Function record is deprecated and will be removed in verison 1.0.0 (current version 0.19.1). Please see `scrapbook.glue` (nteract-scrapbook) as a replacement for this functionality.\n",
      "  This is separate from the ipykernel package so we can avoid doing imports until\n"
     ]
    },
    {
     "data": {
      "application/papermill.record+json": {
       "res_syn": {
        "group_auc": 0.5788,
        "mean_mrr": 0.1931,
        "ndcg@10": 0.2571,
        "ndcg@5": 0.1931
       }
      }
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "res_syn = model.run_eval(valid_file)\n",
    "print(res_syn)\n",
    "pm.record(\"res_syn\", res_syn)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reference\n",
    "\\[1\\] Mingxiao An, Fangzhao Wu, Chuhan Wu, Kun Zhang, Zheng Liu and Xing Xie: Neural News Recommendation with Long- and Short-term User Representations, ACL 2019<br>"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Tags",
  "kernelspec": {
   "display_name": "Python (reco_gpu)\n",
   "language": "python",
   "name": "reco_gpu"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
